16 research outputs found

    PASTA: Pretrained Action-State Transformer Agents

    Full text link
    Self-supervised learning has brought about a revolutionary paradigm shift in various computing domains, including NLP, vision, and biology. Recent approaches involve pre-training transformer models on vast amounts of unlabeled data, serving as a starting point for efficiently solving downstream tasks. In the realm of reinforcement learning, researchers have recently adapted these approaches by developing models pre-trained on expert trajectories, enabling them to address a wide range of tasks, from robotics to recommendation systems. However, existing methods mostly rely on intricate pre-training objectives tailored to specific downstream applications. This paper presents a comprehensive investigation of models we refer to as Pretrained Action-State Transformer Agents (PASTA). Our study uses a unified methodology and covers an extensive set of general downstream tasks including behavioral cloning, offline RL, sensor failure robustness, and dynamics change adaptation. Our goal is to systematically compare various design choices and provide valuable insights to practitioners for building robust models. Key highlights of our study include tokenization at the action and state component level, using fundamental pre-training objectives like next token prediction, training models across diverse domains simultaneously, and using parameter efficient fine-tuning (PEFT). The developed models in our study contain fewer than 10 million parameters and the application of PEFT enables fine-tuning of fewer than 10,000 parameters during downstream adaptation, allowing a broad community to use these models and reproduce our experiments. We hope that this study will encourage further research into the use of transformers with first-principles design choices to represent RL trajectories and contribute to robust policy learning

    Assessing Quality-Diversity Neuro-Evolution Algorithms Performance in Hard Exploration Problems

    Full text link
    A fascinating aspect of nature lies in its ability to produce a collection of organisms that are all high-performing in their niche. Quality-Diversity (QD) methods are evolutionary algorithms inspired by this observation, that obtained great results in many applications, from wing design to robot adaptation. Recently, several works demonstrated that these methods could be applied to perform neuro-evolution to solve control problems in large search spaces. In such problems, diversity can be a target in itself. Diversity can also be a way to enhance exploration in tasks exhibiting deceptive reward signals. While the first aspect has been studied in depth in the QD community, the latter remains scarcer in the literature. Exploration is at the heart of several domains trying to solve control problems such as Reinforcement Learning and QD methods are promising candidates to overcome the challenges associated. Therefore, we believe that standardized benchmarks exhibiting control problems in high dimension with exploration difficulties are of interest to the QD community. In this paper, we highlight three candidate benchmarks and explain why they appear relevant for systematic evaluation of QD algorithms. We also provide open-source implementations in Jax allowing practitioners to run fast and numerous experiments on few compute resources.Comment: GECCO 2022 Workshop on Quality Diversity Algorithm Benchmark

    Neuroevolution is a Competitive Alternative to Reinforcement Learning for Skill Discovery

    Full text link
    Deep Reinforcement Learning (RL) has emerged as a powerful paradigm for training neural policies to solve complex control tasks. However, these policies tend to be overfit to the exact specifications of the task and environment they were trained on, and thus do not perform well when conditions deviate slightly or when composed hierarchically to solve even more complex tasks. Recent work has shown that training a mixture of policies, as opposed to a single one, that are driven to explore different regions of the state-action space can address this shortcoming by generating a diverse set of behaviors, referred to as skills, that can be collectively used to great effect in adaptation tasks or for hierarchical planning. This is typically realized by including a diversity term - often derived from information theory - in the objective function optimized by RL. However these approaches often require careful hyperparameter tuning to be effective. In this work, we demonstrate that less widely-used neuroevolution methods, specifically Quality Diversity (QD), are a competitive alternative to information-theory-augmented RL for skill discovery. Through an extensive empirical evaluation comparing eight state-of-the-art methods on the basis of (i) metrics directly evaluating the skills' diversity, (ii) the skills' performance on adaptation tasks, and (iii) the skills' performance when used as primitives for hierarchical planning; QD methods are found to provide equal, and sometimes improved, performance whilst being less sensitive to hyperparameters and more scalable. As no single method is found to provide near-optimal performance across all environments, there is a rich scope for further research which we support by proposing future directions and providing optimized open-source implementations

    Guidelines for the use and interpretation of assays for monitoring autophagy (3rd edition)

    Get PDF
    In 2008 we published the first set of guidelines for standardizing research in autophagy. Since then, research on this topic has continued to accelerate, and many new scientists have entered the field. Our knowledge base and relevant new technologies have also been expanding. Accordingly, it is important to update these guidelines for monitoring autophagy in different organisms. Various reviews have described the range of assays that have been used for this purpose. Nevertheless, there continues to be confusion regarding acceptable methods to measure autophagy, especially in multicellular eukaryotes. For example, a key point that needs to be emphasized is that there is a difference between measurements that monitor the numbers or volume of autophagic elements (e.g., autophagosomes or autolysosomes) at any stage of the autophagic process versus those that measure fl ux through the autophagy pathway (i.e., the complete process including the amount and rate of cargo sequestered and degraded). In particular, a block in macroautophagy that results in autophagosome accumulation must be differentiated from stimuli that increase autophagic activity, defi ned as increased autophagy induction coupled with increased delivery to, and degradation within, lysosomes (inmost higher eukaryotes and some protists such as Dictyostelium ) or the vacuole (in plants and fungi). In other words, it is especially important that investigators new to the fi eld understand that the appearance of more autophagosomes does not necessarily equate with more autophagy. In fact, in many cases, autophagosomes accumulate because of a block in trafficking to lysosomes without a concomitant change in autophagosome biogenesis, whereas an increase in autolysosomes may reflect a reduction in degradative activity. It is worth emphasizing here that lysosomal digestion is a stage of autophagy and evaluating its competence is a crucial part of the evaluation of autophagic flux, or complete autophagy. Here, we present a set of guidelines for the selection and interpretation of methods for use by investigators who aim to examine macroautophagy and related processes, as well as for reviewers who need to provide realistic and reasonable critiques of papers that are focused on these processes. These guidelines are not meant to be a formulaic set of rules, because the appropriate assays depend in part on the question being asked and the system being used. In addition, we emphasize that no individual assay is guaranteed to be the most appropriate one in every situation, and we strongly recommend the use of multiple assays to monitor autophagy. Along these lines, because of the potential for pleiotropic effects due to blocking autophagy through genetic manipulation it is imperative to delete or knock down more than one autophagy-related gene. In addition, some individual Atg proteins, or groups of proteins, are involved in other cellular pathways so not all Atg proteins can be used as a specific marker for an autophagic process. In these guidelines, we consider these various methods of assessing autophagy and what information can, or cannot, be obtained from them. Finally, by discussing the merits and limits of particular autophagy assays, we hope to encourage technical innovation in the field

    Adaptive optimization problems under uncertainty with limited feedback

    No full text
    Thesis: Ph. D., Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2017.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 159-166).This thesis is concerned with the design and analysis of new algorithms for sequential optimization problems with limited feedback on the outcomes of alternatives when the environment is not perfectly known in advance and may react to past decisions. Depending on the setting, we take either a worst-case approach, which protects against a fully adversarial environment, or a hindsight approach, which adapts to the level of adversariality by measuring performance in terms of a quantity known as regret. First, we study stochastic shortest path problems with a deadline imposed at the destination when the objective is to minimize a risk function of the lateness. To capture distributional ambiguity, we assume that the arc travel times are only known through confidence intervals on some statistics and we design efficient algorithms minimizing the worst-case risk function. Second, we study the minimax achievable regret in the online convex optimization framework when the loss function is piecewise linear. We show that the curvature of the decision maker's decision set has a major impact on the growth rate of the minimax regret with respect to the time horizon. Specifically, the rate is always square root when the set is a polyhedron while it can be logarithmic when the set is strongly curved. Third, we study the Bandits with Knapsacks framework, a recent extension to the standard Multi-Armed Bandit framework capturing resource consumption. We extend the methodology developed for the original problem and design algorithms with regret bounds that are logarithmic in the initial endowments of resources in several important cases that cover many practical applications such as bid optimization in online advertising auctions. Fourth, we study more specifically the problem of repeated bidding in online advertising auctions when some side information (e.g. browser cookies) is available ahead of submitting a bid. Optimizing the bids is modeled as a contextual Bandits with Knapsacks problem with a continuum of arms. We design efficient algorithms with regret bounds that scale as square root of the initial budget.by Arthur Flajolet.Ph. D

    Real-Time Bidding with Side Information

    No full text
    © 2017 Neural information processing systems foundation. All rights reserved. We consider the problem of repeated bidding in online advertising auctions when some side information (e.g. browser cookies) is available ahead of submitting a bid in the form of a d-dimensional vector. The goal for the advertiser is to maximize the total utility (e.g. the total number of clicks) derived from displaying ads given that a limited budget B is allocated for a given time horizon T. Optimizing the bids is modeled as a contextual Multi-Armed Bandit (MAB) problem with a knapsack constraint and a continuum of arms. We develop UCB-type algorithms that combine two streams of literature: the confidence-set approach to linear contextual MABs and the probabilistic bisection search method for stochastic root-finding. Under mild assumptions on the underlying unknown distribution, we establish distribution-independent regret bounds of order Õ(d · √T) when either B = ∞ or when B scales linearly with T

    Robust Adaptive Routing Under Uncertainty

    No full text
    © 2017 INFORMS. We consider the problem of finding an optimal history-dependent routing strategy on a directed graph weighted by stochastic arc costs when the objective is to minimize the risk of spending more than a prescribed budget. To help mitigate the impact of the lack of information on the arc cost probability distributions, we introduce a robust counterpart where the distributions are only known through confidence intervals on some statistics such as the mean, the mean absolute deviation, and any quantile. Leveraging recent results in distributionally robust optimization, we develop a general-purpose algorithm to compute an approximate optimal strategy. To illustrate the benefits of the robust approach, we run numerical experiments with field data from the Singapore road network

    Online learning with a hint

    No full text
    © 2017 Neural information processing systems foundation. All rights reserved. We study a variant of online linear optimization where the player receives a hint about the loss function at the beginning of each round. The hint is given in the form of a vector that is weakly correlated with the loss vector on that round. We show that the player can benefit from such a hint if the set of feasible actions is sufficiently round. Specifically, if the set is strongly convex, the hint can be used to guarantee a regret of O(log(T)), and if the set is q-uniformly convex for q ∈ (2, 3), the hint can be used to guarantee a regret of o(√T). In contrast, we establish Ω(VT) lower bounds on regret when the set of feasible actions is a polyhedron.Office of Naval Research (Grant N00014-15-1-2083
    corecore